Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

11 Nov

Oscar Brown / Zhengjie Wang / Andrea Do / Nikhil Mathew / Cheng Yu
ML Research Labs, Canberra, Australia
Australian National University

Abstract

The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2’s tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by 44%, giving DDD an average speedup of 3.16x.

Introduction

Large Language Models (LLMs) (Brown et al., 2020) (Touvron et al., 2023) have demonstrated im- pressive performance over various tasks. However, their large number of parameters causes inference speed to be too slow for many applications.

Speculative Decoding (Leviathan et al., 2023) addresses this to accelerate an LLM, known as the target model. For each forward pass, the algorithm uses a much smaller draft model to generate a se- quence of tokens to be inputted to the target model. Running the target model once is sufficient to ver- ify the tokens until one is incorrect and generate the token that should follow the correct sequence. This gives a speedup by generating more tokens per forward pass of the target model. Notably, spec- ulative decoding methods are lossless since every token is verified as correct by the target model.

Extrapolation Algorithm for Greater Language- model Efficiency (EAGLE) (Li et al., 2024b) is the state of the art speculative decoding method, with it’s key feature being the construction of a draft model using the embedding layer and LM head of the target model with a single trainable head in be- tween. On its first release, EAGLE used a method of generating a tree of tokens from the draft model and adjusting the target model’s attention mask to allow the entire tree to be inputted simultaneously into the target model. This tree has the structure shown in Figure 2, with the best tokens generated from each previous token being on the left. Al- though this tree chooses the tokens with the highest draft logprobs outputted after each token, its struc- ture is static with no dependence on the draft model output.

EAGLE-2 (Li et al., 2024a) improves on this static tree method by introducing a dynamic draft tree. The tree uses a beam search by choosing the top-k token sequences after each run of the draft model as the next input to the draft model. The sum of all logprobs generated in a sequence of tokens is used as a heuristic for choosing the top-k.

Conclusion

In this work, we introduce Dynamic Depth De- coding, an optimisation of EAGLE-2’s decoding algorithm that increases the speedup of the current state-of-the-art speculative decoding method. We discover an opportunity to use the draft model’s confidence to determine whether to continue draft- ing. Since the heuristic check breaks lazy evalua- tion, we find that it is optimal to check the heuristic only a few times. We also compare our decoding al- gorithm to EAGLE and EAGLE-2 over a variety of models. Future work on speculative decoding that significantly improves on the speedup of EAGLE- 2 will most likely focus on optimising the draft model and the verification process, rather than the drafting algorithm.

Learn more

Paper published at Cornell University

Tim McLaren

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

Using Fine-tuning and Min Lookahead Beam search to improve Whisper

Trellis Data Group